findspam.py: body_text_repeated(): phrase repeated at beginning of body #7002

tripleee · 2022-05-19T07:32:20Z

Second PR, had to back out the broken one I committed yesterday.

test/test_findspam.py: reinstate backed-out test case

Second PR, had to back out the broken one I committed yesterday. test/test_findspam.py: reinstate backed-out test case

tripleee · 2022-05-19T07:34:01Z

The forced anchoring to the beginning of the post is probably something I would like to extend at some point. I don't think many posts have repeated text in the middle (though certainly that happens too) but having a lot of repeats at the end is definitely something I would like to catch, too. I'm thinking it should trigger the same rule, perhaps with a tweak to the detailed "why" message.

tripleee · 2022-05-19T12:23:20Z

I have tested on my laptop for several hours in https://chat.stackexchange.com/rooms/136393/room-for-tripleee-and-ad-oisee and this seems to do what it's supposed to do now.

makyen · 2022-05-20T06:18:40Z

This looks quite a bit better. Using similar regex101 tests (first, second, third) produce a much better final regex, \A(what[\W_]+is[\W_]+your[\W_]+name[\W_]*){10,}, which doesn't match just anything. :)

I wonder how it will perform on a post where the user has a fairly redundant dataset as the first thing in their post. However, in practice, that probably won't be all that many FP hits.

Unfortunately, however, the meta-concerns and side-effects I expressed elsewhere still seem quite valid. Basically, given how things are, there's quite a bit of stuff which needs to change in MS prior to our actually being able to merge this and use it. Thus, I'm going to, reluctantly, tag it as status: deferred.

I'll see if I can take a better/good look at what needs to change in MS.

makyen · 2022-05-21T20:19:49Z

findspam.py

+    if not initial_words:
+        return False, ""
+    escaped_initial_words = [regex.escape(x) for x in initial_words.groups()]
+    period = regex.match(


Please add regex.cache_all(False) prior to this regex.match() and regex.cache_all(True) after it to prevent the single-use regexes compiled here from filling the regex package's internal cache of compiled regular expressions.

Note: this is probably something which we should be doing in a substantial number of places in the code, but I was reminded when creating PR #7012 that it should be used here. Basically, we should be doing that wherever we use a regex which is unique per-post. We should also change things in individual functions/modules to separately store compiled regexes which are used repeatedly in order to reduce the number which we rely on the regex package to cache.

I've struck-out my earlier comments here and closed an open PR I had, as regex.cache_all() is the wrong way to handle this here, or anywhere in SD, due to threading. I'll be writing a helper function which can be used for explicit regex compiles to force the implicit regex cache to not be used. I'll update this again once I've created that (soon). I'll also create an issue in the regex package asking to expose an option to be able to not use the implicit cache on a per-compile basis.

I've opened a PR, #7025, which adds regex_compile_no_cache() to helpers.py and imports it to findspam.py. That function can be used to compile a regular expression and not have it placed in the regex package's implicit cache. Using that function, the code here could be something like:

period_regex = regex_compile_no_cache( r"\A%s[\W_]+%s[\W_]+%s[\W_]+(.{1,40}?)%s[\W_]+%s[\W_]+%s(?=$|[\W_])" % ( tuple(escaped_initial_words * 2)) period = period_regex.match(s) if not period: return False, "" period_words = regex.split(r"[\W_]+", period.groups(0)[0]) escaped_words = escaped_initial_words + [ regex.escape(x) for x in period_words] repeats_regex = regex_compile_no_cache(r"\A(" + r"[\W_]+".join(escaped_words) + r"[\W_]*){10,}") repeats = repeats_regex.match(s)

makyen · 2022-05-21T20:20:58Z

findspam.py

+    escaped_words = escaped_initial_words + [
+        regex.escape(x) for x in period_words]
+    repeats_regex = r"\A(" + r"[\W_]+".join(escaped_words) + r"[\W_]*){10,}"
+    repeats = regex.match(repeats_regex, s)


~~Same request here regarding regex.cache_all(False) prior to this regex.match() and regex.cache_all(True) after it.~~

~~Obviously, it would be possible to use only a single pair of regex.cache_all(False) and regex.cache_all(True) to cover both post-specific uses of a regular expression.~~

This also touches on the larger issue of how we use the regex package's cache, particularly within the detections. The larger issue is that we don't want to be using it for regexes which are individually generated specific to each post, but we do want any static regular expressions not to have to be recompiled for every use (e.g. through using the regex package's cache). There are a couple of different approaches which we could take in general. I'll open an Issue to cover/discuss the topic.

findspam.py: body_text_repeated(): phrase repeated at beginning of body

80ad5bf

Second PR, had to back out the broken one I committed yesterday. test/test_findspam.py: reinstate backed-out test case

tripleee requested a review from makyen May 19, 2022 12:23

makyen added area: spamchecks Detections or the process of testing posts. (No space in the label, is because of Hacktoberfest) status: deferred 6-8 longer weeks. labels May 20, 2022

makyen requested changes May 21, 2022

View reviewed changes

makyen added the status: confirmed Confirmed as something that needs working on. label Jun 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

findspam.py: body_text_repeated(): phrase repeated at beginning of body #7002

findspam.py: body_text_repeated(): phrase repeated at beginning of body #7002

tripleee commented May 19, 2022

tripleee commented May 19, 2022

tripleee commented May 19, 2022

makyen commented May 20, 2022

makyen May 21, 2022 •

edited

Loading

makyen May 25, 2022

makyen May 26, 2022

makyen May 21, 2022 •

edited

Loading

makyen May 21, 2022 •

edited

Loading

findspam.py: body_text_repeated(): phrase repeated at beginning of body #7002

Are you sure you want to change the base?

findspam.py: body_text_repeated(): phrase repeated at beginning of body #7002

Conversation

tripleee commented May 19, 2022

tripleee commented May 19, 2022

tripleee commented May 19, 2022

makyen commented May 20, 2022

makyen May 21, 2022 • edited Loading

Choose a reason for hiding this comment

makyen May 25, 2022

Choose a reason for hiding this comment

makyen May 26, 2022

Choose a reason for hiding this comment

makyen May 21, 2022 • edited Loading

Choose a reason for hiding this comment

makyen May 21, 2022 • edited Loading

Choose a reason for hiding this comment

makyen May 21, 2022 •

edited

Loading

makyen May 21, 2022 •

edited

Loading

makyen May 21, 2022 •

edited

Loading